DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods
Fitted Model Equation: Premolti = β0 + β1 × Postmolti + εi, where εi ~ Normal(0, σε)
Fitted Model Equation: Scorei = β0 + β1 × Submiti + εi, where εi ~ Normal(0, σε)
We have fitted the best-fit line describing the average response with an explanatory variable
What if we wanted to infer plausible values of average response and the response itself?
This is known as making predictions with your fitted model, and why the explanatory variable is often called a predictor
Multiple \(R^2\) describes the proportion of the total variability of the response variable that is explained by the fitted model
Analysis of Variance Table
Response: Premolt
Df Sum Sq Mean Sq F value Pr(>F)
Postmolt 1 41393 41393 10389 < 2.2e-16 ***
Residuals 359 1430 4
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)
Residuals:
Min 1Q Median 3Q Max
-4.6485 -1.3060 0.0829 1.2683 11.0291
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.94743 1.55480 -18.62 <2e-16 ***
Postmolt 1.09948 0.01079 101.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared: 0.9666, Adjusted R-squared: 0.9665
F-statistic: 1.039e+04 on 1 and 359 DF, p-value: < 2.2e-16
Our fitted (simple linear regression) model describes 96.6% of the variability in the premolt sizes of female Dungeness crabs
Is there an “optimal” value for Multiple \(R^2\)?
Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)
Residuals:
Min 1Q Median 3Q Max
-4.6485 -1.3060 0.0829 1.2683 11.0291
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.94743 1.55480 -18.62 <2e-16 ***
Postmolt 1.09948 0.01079 101.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared: 0.9666, Adjusted R-squared: 0.9665
F-statistic: 1.039e+04 on 1 and 359 DF, p-value: < 2.2e-16
It is an interval that can capture the “true” mean response for all elements of the “population” for a specific explanatory variable value (based on the fitted linear regression model)
Two sources of uncertainty—\(\beta_0\) and \(\beta_1\)
For female Dungeness crabs whose postmolt size is 145 mm, we estimate with 95% confidence that their average premolt size somewhere between 126.5 and 128.4 mm
We can leverage the emmeans package to do the mathematics for us…
library(emmeans)
emmeans(crabs.fit, ~ Postmolt, at = list(Postmolt = c(125, 135, 145))) |>
pairs(infer = c(TRUE, FALSE)) # Just to "chop-off" p-values for this example contrast estimate SE df lower.CL upper.CL
Postmolt125 - Postmolt135 -11 0.108 359 -11.2 -10.7
Postmolt125 - Postmolt145 -22 0.216 359 -22.5 -21.5
Postmolt135 - Postmolt145 -11 0.108 359 -11.2 -10.7
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 3 estimates
We are 95% sure that the difference between that the mean premolt size of female Dungeness crabs whose postmolt size is 125 mm is somewhere between 10.7 and 11.2 mm lower than that of crabs whose postmolt size is 135 mm
It is an interval that can capture a plausible response value for a specific observation in the “population” for a specific explanatory variable value (based on the fitted linear regression model)
Three sources of uncertainty—\(\beta_0\), \(\beta_1\), and \(\sigma_\varepsilon\)
We predict with 95% confidence that a female Dungeness crab’s premolt size is somewhere between 126.5 and 128.4 mm if its postmolt size is 145 mm
For crabs.df:
plot(Premolt ~ Postmolt, data = crabs.df, main = "Premolt vs postmolt sizes of dungeness crabs",
xlab = "Postmolt size (mm)", ylab = "Premolt size (mm)", col = rgb(0, 0, 0, alpha = 0.25))
crabs.fit <- lm(Premolt ~ Postmolt, data = crabs.df)
x <- seq(100, 180, length.out = 100)
y <- predict(crabs.fit, newdata = data.frame(Postmolt = x), interval = "prediction")
yhat <- predict(crabs.fit, newdata = data.frame(Postmolt = x), interval = "confidence")
abline(crabs.fit, col = "red")
lines(x = x, y = y[, 2], col = "blue", lty = 5)
lines(x = x, y = y[, 3], col = "blue", lty = 5)
lines(x = x, y = yhat[, 2], col = "red", lty = 5)
lines(x = x, y = yhat[, 3], col = "red", lty = 5)For submit.df:
plot(Score ~ Submit, data = submit.df, main = "Exam marks vs final submission times of students", xlim = c(0, 24),
xlab = "Final submission time (hours)", ylab = "Exam mark (out of 80)", col = rgb(0, 0, 0, alpha = 0.25), xaxs = "i")
submit.fit <- lm(Score ~ Submit, data = submit.df)
x <- seq(0, 24, length.out = 100)
y <- predict(submit.fit, newdata = data.frame(Submit = x), interval = "prediction")
yhat <- predict(submit.fit, newdata = data.frame(Submit = x), interval = "confidence")
abline(submit.fit, col = "red")
lines(x = x, y = y[, 2], col = "blue", lty = 5)
lines(x = x, y = y[, 3], col = "blue", lty = 5)
lines(x = x, y = yhat[, 2], col = "red", lty = 5)
lines(x = x, y = yhat[, 3], col = "red", lty = 5)